ICOR: improving codon optimization with recurrent neural networks

Background In protein sequences—as there are 61 sense codons but only 20 standard amino acids—most amino acids are encoded by more than one codon. Although such synonymous codons do not alter the encoded amino acid sequence, their selection can dramatically affect the expression of the resulting protein. Codon optimization of synthetic DNA sequences is important for heterologous expression. However, existing solutions are primarily based on choosing high-frequency codons only, neglecting the important effects of rare codons. In this paper, we propose a novel recurrent-neural-network based codon optimization tool, ICOR, that aims to learn codon usage bias on a genomic dataset of Escherichia coli. We compile a dataset of over 7,000 non-redundant, high-expression, robust genes which are used for deep learning. The model uses a bidirectional long short-term memory-based architecture, allowing for the sequential context of codon usage in genes to be learned. Our tool can predict synonymous codons for synthetic genes toward optimal expression in Escherichia coli. Results We demonstrate that sequential context achieved via RNN may yield codon selection that is more similar to the host genome. Based on computational metrics that predict protein expression, ICOR theoretically optimizes protein expression more than frequency-based approaches. ICOR is evaluated on 1,481 Escherichia coli genes as well as a benchmark set of 40 select DNA sequences whose heterologous expression has been previously characterized. ICOR’s performance is measured across five metrics: the Codon Adaptation Index, GC-content, negative repeat elements, negative cis-regulatory elements, and codon frequency distribution. Conclusions The results, based on in silico metrics, indicate that ICOR codon optimization is theoretically more effective in enhancing recombinant expression of proteins over other established codon optimization techniques. Our tool is provided as an open-source software package that includes the benchmark set of sequences used in this study. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05246-8.

Polymerase acidic protein (PA) plays a role in viral RNA transcription and replication. It is from the Influenza A virus. It has been studied in ( codon optimization papers ). Human peptide deformylase (hPDF) is a target for cancer therapeutics. However, its expression is not very efficient in E. coli. This serves as a good target for benchmarks as past studies have noted the valuable potential of codon optimization of this gene. EMG1 aa: 245 bp: 735 EMG1 (Protein: nucleolar protein homolog (S. cerevisiae) ) is a human-based recombinant protein expressed in E. coli . This gene encodes a protein that methylates pseudouridine and is an essential eukaryotic protein. In Reference Paper , EMG1 expression levels were analyzed in E. coli . CDK1 aa: 298 bp: 894 CDK1 (Cyclin Dependent Kinase 1) codes for a protein which is essential for G1/S and G2/M phase transitions. It has been used in calculating prognostic value in human cancer ( Reference Paper ). Higher expression levels correlated with a more-advanced tumor. Another reference paper analyzing the importance of this gene looked at expression and structure through recombinant expression in E. coli ( Reference Study ). Cd80 (Mus musculus CD80 antigen) is a protein-coding gene whose protein, once activated, induces T-cell proliferation and cytokine production. In Reference Paper , Cd80 expression levels were analyzed in E. coli. Pim-1 oncogene (PIM1) encodes a protein that plays a role in signal transduction in blood cells. The gene is expressed primarily in B-lymphoid and myeloid cell lines. It is found to be overexpressed in hematopoietic malignancies and in prostate cancer. In Reference Paper , PIM1 expression levels were analyzed in E. coli .

FALVAC-1 aa: 324 bp: 972
FALVAC-1 is a vaccine against Plasmodium Falciparum. This is a good benchmark as a synthetic gene of codon optimization tools because it is a real candidate vaccine that is recombinantly produced ( Reference Paper ). Mitogen-Activated Protein Kinase 1 (MAPK1) is a part of the MAP kinase signal transduction pathway. It is of special interest because in this Reference Paper , when expressed in E. coli , there was a significant difference in the protein yield for their wild-type and optimized genes: 24.3 and 11.5 mg/L respectively. "This gene encodes one of at least three opioid receptors in humans; the mu opioid receptor (MOR). The MOR is the principal target of endogenous opioid peptides and opioid analgesic agents such as beta-endorphin and enkephalins.
The MOR also has an important role in dependence to other drugs of abuse, such as nicotine, cocaine, and alcohol via its modulation of the dopamine system" ( Study ). In Reference Paper , OPRM1 expression levels were analyzed in E. coli .

aa: 418 bp: 1254
Lysosomal-associated membrane protein 1 (LAMP1) is a protein coding gene that is associated with Chediak-Higashi Syndrome and Gaucher's Disease .This is a gene of interest because it may play a role in tumor cell metastasis. It was expressed as a "transcription factor" in this Reference Paper in E. coli ; its expression was measured in vivo. AKT serine/threonine kinase 1, naturally found in humans, encodes one of three kinases which are referred to as protein kinase B alpha, beta, and gamma. In the Reference Paper , AKT1 was found to have a reduction in optimized heterologous expression by 10% compared to the wild-type gene. This makes it a particularly interesting benchmark gene to gauge the performance of codon optimization tools.

LCK aa: 510 bp: 1530
Lymphocyte-specific protein tyrosine kinase ( LCK ) is a gene that encodes a protein that is a key signaling molecule "...in the selection and maturation of developing T-cells" ( Reference ). In Reference Paper , LCK expression levels were analyzed in E. coli . It was found that LCK showed a 500-fold increase in mRNA transcripts for the sequence-optimized gene. Pseudomonas aeruginosa exotoxin A (PEA) is an important pathogenic factor 1 . It retains high immunogenicity even after detoxification, enabling its use as vaccine adjuvants and vaccine carriers. This is a good benchmark as it can be produced for a low-cost at a large-scale in E. coli. Further, the ( Reference Paper ) finds that codon optimization enhances expression of PEA in E. coli --thus, if the tool presented in this research can achieve similar/better results than the paper and/or other approaches, it will be considered an improvement. Transporter 1, ATP-binding cassette, sub-family B (MDR/TAP) (TAP1) encodes a protein known to be involved in molecular transport and drug resistance. In Reference Paper , TAP1 expression levels were analyzed in E. coli . Upstream Binding Transcription Factor (UBTF) encodes a protein important for ribosomal RNA transcription. The UBTF studied originates from human genes. UBTF is well-known as a recombinant protein and has been noted as useful as a blocking peptide for certain antibodies. BRAF1 aa: 767 bp: 2301 BRAF1 (v-raf murine sarcoma viral oncogene homolog B1) is a protein kinase that transduces mitogenic signals from the cell membrane to the nucleus. This kinase phosphorylates MAP2K1 which activates the MAP kinase signal pathway (PubMed: 21441910 , PubMed: 29433126 ). This gene is a gene of interest because it is frequently mutated and allows a cell to become a tumor cell. In Reference Paper , BRAF1 expression levels were analyzed in E. coli. Mutations in this gene have been associated with osteopoikilosis, Buschke-Ollendorff syndrome and melorheostosis. It was expressed as a "membrane protein" in this Reference Paper in E. coli; its expression was measured in vivo.

MMLP3 aa: 945 bp: 2835
Proteins associated with the MMP family are known to be involved in the breakdown of extracellular matrices in physiological processes within the cell. The MMP3 or MMLP3 gene encodes an enzyme that degrades glycoproteins such as fibronectin. It has been studied in ( codon optimization papers ).
CEBPZ aa: 1055 bp: 3165 CEBPZ ( CCAAT/enhancer binding protein zeta ) is a human protein-encoding gene that plays a role in responding to environmental stimuli (heat). In Reference Paper , CEBPZ expression levels were analyzed in E. coli which makes it a good benchmark gene to compare to. KIF11 aa: 1057 bp: 3171 KIF11 (Kinesin family member 11) was expressed recombinantly in E. coli in this study . "This gene encodes a motor protein that belongs to the kinesin-like protein family. Members of this protein family are known to be involved in various kinds of spindle dynamics." ( KIF11 Gene (Protein Coding) ). NPR1 aa: 1062 bp: 3186 NPR1 ( Natriuretic Peptide Receptor 1 ) is a gene that encodes a peptide receptor that is located in the kidney, lungs, and adipocytes. NPR1 is associated with diseases including congestive heart failure and malt worker's lung. In Reference Paper , NPR1 expression levels were analyzed in E. coli . Programmed Cell Death 11 (PDCD11) is a useful binding protein that colocalizes with U3 RNA (MIM 180710) in the nucleolus and is required for rRNA maturation and generation. It is important because, as a plasma protein, it is within the nucleolus of the cell and is a ribosomal protein. In Reference Paper , PDCD11 expression levels were analyzed in E. coli .